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Abstract 

The long-standing problem of Shannon entropy estimation in data streams (assuming the strict Turnstile 
model) is now an easy task by using the technique proposed in this paper. Essentially speaking, in order to 
estimate the Shannon entropy with a guaranteed i/-additive accuracy, it suffices to estimate the ath frequency 
moment, where a = 1 — A, with a guaranteed e-multiplicative accuracy, where e — vA. Previous studies have 
shown that A has to be extremely small (e.g., A < 10~* or even much smaller). In other words, the sample 
complexity for entropy estimation is O (^) — O ( ^-i'^i ) , where V is the coefficient essentially determined by 
the variance of the estimator of frequency moments. In this paper, the proposed algorithm achieves V — O ( A^) 
and hence is a practical technique with complexity O (t-) (which is essentially 0(1) if we consider f — 0(1)). 
We provide the (small) complexity bound constants numerically (for < < 1) and analytically (for small u). 

Prior well-known algorithms based on symmetric stable random projections could only achieve V — O (1), 
meaning that the sample complexity would be O ( •a^-i ), which will be extremely large. For example, if 
u = 0(1) and A = 10-^ then O (772^) = O (lO^^) . 

Compressed Counting (CO 1271 . based on maximally skewed stable random projections, was recently pro- 
posed for estimating the ath frequency moment of data streams. | 27 | proposed algorithms for CC based on the 
geometric mean and harmonic mean estimators. It was proved that the geometric mean estimator could achieve 
V — O (A), leading to an O (7;^) algorithm, which unfortunately could still be impractical. In this paper, we 
prove that the harmonic mean estimator for CC also could only achieve V = O (A). 



The proposed new estimator for CC has a simple clean form: , ^ 

_ Sj=i ■ 

projected data and k is the sample size. We prove that its variance achieves V = O (A^), leading to a prac- 
tical algorithm with complexity O (7^). In other words, if A = 10"^, the new algorithm improves the prior 
algorithms based the symmetric stable random projections roughly by a factor of 10^"; and it improves the geo- 
metric/harmonic mean algorithms for CC roughly by a factor of 10''. 

Our extensive experiments (in the Appendix) verify that, using the proposed algorithm, fc ~ 10 samples 
could provide accurate estimates of the Shannon entropy. The proposed algorithm is also numerically very 
stable, even for A as small as 10"^". 



1 Introduction 

The problem of "scaling up for high dimensional data and high speed data streams" is among the "ten challenging 
problems in data mining research"[39|. This paper is devoted to estimating entropy of data streams. Mining data 
streams [^20l |4l [Tl [32| in (e.g.,) 100 TB scale databases has become an important area of research, e.g., ifTOl fTI. as 
network data can easily reach that scale|39|. Search engines are a typical source of data streamsl^. 
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Consider the Turnstile stream model ||32| . The input stream at — {it, It), it G [1, D] arriving sequentially 
describes the underlying signal A, meaning 

At[it]^ At-i[it] + It, (1) 

where the increment It can be either positive (insertion) or negative (deletion). Restricting At[i] > results in 
the strict-Turnstile model, which suffices for describing almost all natural phenomena f32l. 

This study focuses on the relaxed strict-Turnstile model and studies efficient algorithms for estimating the ath 
frequency moment of data streams 

D 

^M = E^*W"- ^2) 

i=l 

We are particularly interested in the case of a — ^ 1, which is very important for estimating Shannon entropy. 

The relaxed strict-Turnstile model only requires At[i] > at the time t one cares about (e.g., the end of 
streams); and hence it is considerably more flexible than the strict-Turnstile model. 



1.1 Entropy, Moments, and Estimation Complexity 

A very useful (e.g., in Web and networks lfT2l l25l l40l [30l and neural comptutations|33|) summary statistic is the 
Shannon entropy 

7^1 ^(1) -^(1) 

Various generalizations of the Shannon entropy have been proposed. The Renyi entropy 1 34 1, denoted by Ha, and 
the Tsallis entropv [19..361 , denoted by T^, are respectively defined as 

T^ = J_(Ei2l^l], (5) 

As a — ?> 1, both Renyi entropy and Tsallis entropy converge to Shannon entropy: limQ_j.i Ha = limc^i Ta = 
H. Thus, both Renyi entropy and Tsallis entropy can be computed from the ath frequency moment; and one can 
approximate Shannon entropy from either Ha or Ta by letting awl. Several studies ll40l [TSl [TTl ) used this 
idea to approximate Shannon entropy, all of which relied critically on efficient algorithms for estimating the ath 
frequency moments (|2|i near a = 1. In fact, one can numerically verify that the a values proposed in 1 18 17J are 
extremely close to 1, for example, A = |1 - a| < 10"'' |[T8] Alg. 1] or A < lO"^!!?) are quite likelyQ 



From the definition of the Renyi and Tsallis entropies, it is clear that, in order to achieve a z^-additive 
guarantee for the Shannon entropy, it suffices to estimate the ath frequency moment with an e = i^A guar- 
antee (for sufficiently small A). For example, suppose an estimator Ff^a) guarantees (with high probability) 
that (1 — e)F(a) < F{a) < (1 + £)F{a), then the estimated Renyi entropy, denoted by Ha would satisfy 
Ha — V < Ha < Ha + V, assuming A is sufficiently small. 

Another perspective is from the estimation variances. From the definitions of the Renyi and Tsallis entropies, 
it is clear that we need estimators of the frequency moments with variances proportional to O (A^) in order to 
cancel the term s.^ . The estimation variance, of course, is also closely related to the sample complexity. 



' In 1 18 Alg. 1], A = ig iog(i/c) ' ~ 4log(£i') log(m) • where m is the number of streaming updates. If we let D = 2^"^, m = 2^*, 
u = 0.01, then A ^ 6 X 10~^. If we let m = 10"^, = 0.1, then A 2 X 10"^. 

1171 Sec. 4.2.2 and 5.2] provides some improvements, to allow larger A. If m = 2®'' and u = 0.01, then A 7 X 10^®. If m = 10'' and 
u = 0.1, then A lO""*. 
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Suppose we have an unbiased estimator of whose variance is ^F^^y where k is the sample size. Then 

the sample complexity is essentially O (iY F^a)) / (^'^ -^{a))^ ^ ^ (^/^^)' ^^^ing the standard argument popular 

in the theory literature, e.g., ||231 . The space complexity (in terms of bits) will be O {v/ log X]l=i Vs | j ■ The 
drawback of this argument is that it does not fully specify the constants. 

In a summary, in order to provide a v (e.g., 0.1) additive approximation of the Shannon entropy, one should use 
O {V/e^^ = O (y/(i^A)^) samples for estimating the (l±A)th frequency moments. This bound initially appears 
disappointing, because, if for example, V = 0(1), ly — 0.1, A — 10^^, then it requires O (lO^^) samples, which 
is very likely impractical. Well-known algorithms based on symmetric stable random proiections fT\[ l26ll indeed 
exhibit V = 0(1). 

1.2 Some Applications of Shannon Entropy 

1.2.1 Real-Time Network Anomaly Detection 

Network traffic is a typical example of high-rate data streams. An effective and reliable measurement of network 
traffic in real-time is crucial for anomaly detection and network diagnosis; and one such measurement metric is 
Shannon entropy llT2l [241 [38l l7l l25l l40l . The Turnstile data stream model ([T]) is naturally suitable for describing 
network traffic, especially when the goal is to characterize the statistical distribution of the traffic. In its empirical 
form, a statistical distribution is described by histograms, « = 1 to D. It is possible that D = 2^'^ (IPV6) if 
one is interested in measuring the traffic streams of unique source or destination. 

The Distributed Denial of Service (DDoS) attack is a representative example of network anomalies. A DDoS 
attack attempts to make computers unavailable to intended users, either by forcing users to reset the computers 
or by exhausting the resources of service-hosting sites. For example, hackers may maliciously saturate the victim 
machines by sending many external communication requests. DDoS attacks typically target sites such as banks, 
credit card payment gateways, or military sites. 

A DDoS attack changes the statistical distribution of network traffic. Therefore, a common practice to detect 
an attack is to monitor the network traffic using certain summary statistics. Since Shannon entropy is a well-suited 
for characterizing a distribution, a popular detection method is to measure the time-history of entropy and alarm 
anomalies when the entropy becomes abnormal 1 121 1251 . 

Entropy measurements do not have to be "perfect" for detecting attacks. It is however crucial that the algorithm 
should be computationally efficient at low memory cost, because the traffic data generated by large high-speed 
networks are enormous and transient (e.g., 1 Gbits/second). Algorithms should be real-time and one-pass, as 
the traffic data will not be stored|i4J. Many algorithms have been proposed for "sampling" the traffic data and 
estimating entropy over data streams ll25ll40l l6l[T6l[3l[8l [T8l[T7l . 

1.2.2 Entropy of Query Logs in Web Search 

The recent workfSOl was devoted to estimating the Shannon entropy of MSN search logs, to help answer some 
basic problems in Web search, such as, how big is the web? 

The search logs can be viewed as data streams, and 1301 analyzed several "snapshots" of a sample of MSN 
search logs. The sample used in [30l contained 10 million <Query, URL,IP> triples; each triple corresponded 
to a click from a particular IP address on a particular URL for a particular query. |[30l drew their important 
conclusions on this (hopefully) representative sample. Alternatively, one could apply data stream algorithms such 
as CC on the whole history of MSN (or other search engines). 

1.2.3 Entropy in Neural Computations 

A workshop in NIPS '03 was devoted to entropy estimation, owing to the wide-spread use of Shannon entropy in 
Neural Computations f33l|. (http: / /www . menem . com/ ~ilya/pages/NIPS03) For example, one appli- 
cation of entropy is to study the underlying structure of spike trains. 
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1.3 Previous Algorithms for Estimating Frequency Moments 



The problem of approximating has been very heavily studied in theoretical computer science and databases, 
since the pioneering work of [2^, which studied a = 0,2, and a > 2. ifTTl 1211 1261 provided improved algorithms 
for < a < 2. [,22] provided algorithms for a > 2 to achieve the lower bounds proved by L331 [5l [37ll . lfT4ll 
suggested using even more space to trade for some speedup in the processing time. 

Note that the first moment (i.e., the sum), can be computed easily with a simple counter llsTl [131 l2l . This 
important property was recently captured by the method of Compressed Counting (CC) ll27l . which was based 
on the maximally-skewed stable random projections. |27| provided two algorithms, based on the geometric mean 
and harmonic mean^ and proved some important theoretical results: 

• The geometric mean algorithm has the variance proportional to 0( A) in the neighborhood of a = 1, where 
A = |1 — q;|. This is the first algorithm that captured the intuition that, in the neighborhood of a = 1, the 
moment estimation algorithms should work better and better as a —> 1, in a continuous fashion. 

Our comments: The geometric mean algorithm, unfortunately, did not provide an adequate mechanism for 
entropy estimation. As previously discussed, this methods leads to an entropy estimation algorithm with 
complexity O (l/(i/^A)), which is actually quite intuitive from the definitions of the Tsallis entropy and 
Renyi entropy. Both entropies contain the = ^ terms, meaning that the variance will blow up as 
0(1/ A^) , which can not be canceled by O (A). Note that [27.1 did not show the variance of the harmonic 
mean algorithm is also proportional to O (A); this paper will provide the proof. 

• For fixed e, as A — > 0, the sample complexity bound of the geometric mean algorithm is O (1/e) with all 
constants specified. This result was a major improvement over the well-known O (l/e^) bound ll37l [2T, 26|. 
Note that the assumption of fixing e and letting A — is needed for theoretical convenience in order to 
derive bounds with no unspecified constants. This study will continue to use this assumption. 

Our comments: When a — 1, the moment estimation problem is trivial and only requires one simple 
counter. Therefore, even intuitively, O (1/e) can not possibly be the true complexity bound. 



2 The Proposed Algorithm 

We consider the relaxed strict-Turnstile model ([T]i. Conceptually, we multiply the data stream vector At G ^ ^ 
by a random projection matrix R e M^^*^. The resultant vector = ylt x R G K.'^ ^ ^ is only of length k. More 
specifically, the entries of the projected vector X are 

D 

Xj = [At X K]. = njAt\i], j = 1, 2, k 
1=1 

rij's are random variables generated from the following (non-standard) skewed stable distribution [|4 11 : 

sin (avij) 
[smvijl 

where Vij ^ Uniform{0, tt) (i.i.d.) and Wij ^ Exp{l) (i.i.d.), an exponential distribution with mean 1. We use 
this formulation to avoid numerical problems and simplify the analysis. 



sin (vijA) 



1 - a > 0, 



(6) 



Of course, in data stream computations, the matrix R is never fully materialized. The standard procedure in 
data stream computations is to generate entries of R on-demand Il2l1 . In other words, whenever an stream element 
at = {it, It) arrives, one updates entries of X as 

Xj ^ Xj + ItVi^j, j = l,2,...,k. 

- The geometric mean and harmonic mean algorithms could be empirically improved using another algorithm based on numerical 
optimizations i28i , which is very difficult for precise theoretical analysis (variances and bounds). 
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The proposed algorithm is defined as follows: 



-I A 



AA y-fe - 



-a/A 



(7) 



The following Theorem proves that this new estimator is (asymptotically) unbiased with the variance propor- 
tional to O (A^). Note that A^ ^ 1 as A ^ 0. 



Theorem 1 



Var 



A2 



l + O 



A 
J 

3-2A + 



(8) 
(9) 



Proof: See Appendix^ 



In this paper, we only consider a = 1 ~ A < 1. This is because the maximally-skewed stable distributions 
have good theoretical properties when a < l [i27il : for example, all negative moments exist; see Lemma|2] 

2.1 Review Maximally-Skewed Stable Random Projections and Estimators 

The standard procedure for sampling from skewed stable distributions is based on the Chambers-Mallows-Stuck 
method|9|. To generate a sample from S{a, (3 = 1, 1), i.e., a-stable, maximally-skewed (/3 = 1), with unit scale, 
one first generates an exponential random variable with mean I, W ^ Exp{l), and a uniform random variable 

U ^ Uniform (-f, |), then. 



Z' = 



sin(a(f7 + p) ) 
[cos C/ cos (pa)] 



l/a 



cos([/ - a{U + p)) 



W 



5(a,/3-l,l), 



(10) 



where p = ? when a < 1 and p = 5 ^— ^ when a > 1. 



Note that cos (^a 



as a — > 1. For convenience (and avoiding numerical problems), we will use 
Z = Z' cos^/" {pa) - 5 (a, /3 = 1, cos [pa)) . 
In this study, we will only consider a = 1 — A < 1, i.e, p = f • After simpHfication, we obtain 



where V 



^ ^ sin {aV) 
[sinV]^/" 

U ~ Uniform{Q, ir). This explains 



sin(yA) 
W 



(11) 



Lemma[T]shows logZ = 0(|AlogA|), which can be accurately represented using 0(logl/A) bits. The 
proof is omitted since it is straightforward. 

Lemma 1 For any given V =^ 0, and W ^ 0, as A ^ 0, 

Z = l + 0(|AlogA|), i.e., logZ = 0(|AlogA|). 



Let X = At X JH, where entries of R are i.i.d. samples of S [a,(3 = 1, cos (fa))- Then by properties of 
stable distributions, entries of X are 

D 

Xj = [At X R]^. = ^ n^jAt\i] 5 (^a, /3 = 1, cos (J^aj , 
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where F(^a) = J2i=i [i]" as defined in (|2]i. 



Therefore, CC boils down to estimating Fj^j from k i.i.d. stable samples. [27| provided two statistical 
estimators, the geometric mean and harmonic mean estimators, which are derived based on the following basic 
moment formula. 

Lemma 2 IfX ~ S{a, /? = 1, cos (^)), then X > Q, and for any -oo < A < a < 1, 

V > (a) r(i-A) 



2.1.1 The Geometric Mean Estimator 

Assume Xj, j = 1 to k, are i.i.d. samples from S{a, /3 = 1, Fi^^) cos (^) ). After simplifying the corresponding 
expression in lIZTl . we obtain 

(12) 

which is unbiased and has asymptotic variance 

Var(F(„),,„) =^^A(l + a) + o(^-j (13) 
As a — >^ 1, the asymptotic variance approaches zero at the rate of only O (A), which is not adequate. 



F, 



{a),gn 



r 1 



r 1- 



2.1.2 The Harmonic Mean Estimator 



which is asymptotically unbiased and has variance 

Var {F,^..„^ = ^ ( ,7 - 1 I + O ( I . (15) 



||27l only graphically showed that the harmonic mean estimator is noticeably better than the geometric mean 
estimator We prove the following Lemma, which says the variance of the harmonic mean is also proportional to 
O (A). Thus, the harmonic mean estimator is not adequate for entropy estimation either. 



Lemma 3 AiA = l — a— >0, 

2r2(l + a) 



r(l + 2a) V 6 

Proof: See Appendix^ 



A + A2 2-^ +0(A3). (16) 



2.2 The Distribution Function 

This section provides the distribution function ofZ ~ S'(q;<1,/3 = 1, cos (fa)), which will be needed in 
deriving the proposed estimator ([7]). 
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Lemma 4 Suppose a random variable Z ^ S (^a < 1,/3 = 1, cos The cumulative distribution function 

(CDF) is 



Fz{t) = Pr (Z < t) = i ^ exp (^-r°'^^g {6; A)) d0. 



where 



g{9- A) = ^° ^ sin (M) , ^ e (0, ^) 

[smt'J ' 



AssMme A = 1 — a < 0.5, then g(9; A) /i monotonically increasing in (0, tt), with 



lmi^g{0;A) = 5 (0+; A) = Aa"/'^. 



Moreover, g(6; A) k a convex function of 9. 
Proof: See Appendix\C\ 



Note that g (0+; A) = Aa°'^^ sa Ae ^ approaches zero as A 0. Thus, one might be wondering if we 
replace g {9; A) by g (0+; A), the errors may be quite small. This conjecture is verified in Figure[T] 



1 
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Figure 1: We plot the CDF curves as derived in LemmaH] for A = 10^^, 10^^, and lO^'^. As A — > 0, the exact 
CDF (solid curves) is very close to the approximate CDF (dashed curves), which we obtain by replacing the exact 
g{9; A) function in LemmaHwith the limit 5(0+; A). 



2.3 The Intuition Behind the Proposed Algorithm 

Basically, we derive the proposed estimator by "guessing." We first derive a maximum likelihood estimator (MLE) 
for a slightly different distribution based on the intuition from Lemma |4] and Figure [T] Then we verify that this 
MLE is actually a very good estimator (in terms of both the variances and tail bounds) for the stable distribution 
we care about. 

Here, we consider a random variable Y whose cumulative distribution function (CDF) is 

Fy(<) = Pr (y < i) = cxp (^-i-^/^Aa"/"^) , te[0,oo). (17) 

It is indeed a CDF because it is an increasing function of i e [0, 00), Fy{0) — 0, and Fy (oo) = 1. 

Similar to stable random projections, we are interested in estimating c" from k i.i.d. samples Xj — cYj, j — 1 
to k. Statistics theory tells us that the maximum likelihood estimator (MLE) has the (asymptotic) optimality. 
Because the distribution function of Yj is known, we can actually compute the MLE in this case. 
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Theorem 2 Suppose Yj, j — 1 to k, are i.i.d. samples from a distribution whose CDF is given by ( I-/ 71 ). Let 
Xj = cYj, where c > 0. Then the maximum likelihood estimator of c" is given by 

1 

Proof: See Appendix\D\ 

Compared with the proposed estimator Fj^i) in the MLE solution has the addition term of Note that, 
while both and a" approach 1, — ^ 1 considerably slower than a" — > 1, because 

a" = exp (a log a) = exp (- A + O ( A^) ) 
A"^ exp (A log A) 

For example, when A = 0.1, A'^ = 0.7943, a" = 0.9095; when A = 0.01, A^ = 0.9550, a" = 0.9901. 

Therefore, while a" may be considered negligible, it may be preferable to keep A^. In fact, when proving 
that the proposed estimator Fj^) is (asymptotically) unbiased (see AppendixlAli. we do need the A'^ term. 



1 = 1 



-q/A 



(18) 



3 The Tail Bounds of the Proposed Estimator 

Theorem[T]has proved that the proposed estimator 



1 



is asymptotically unbiased with variance proportional to O(A^). Using the standard argument, we know that the 
sample complexity bound must be O (^■^r^ = O (^7^7^ — O We are, however, very interested in the 

precise complexity bounds, not just the orders. 

Normally, we would like to present the tail bounds as, e.g., Pr ^-^(0) > [1 + t) F(^a}^ < exp ^— fc^j, 
which immediately leads to the statement that: 

^ Gr . j^^ 



With probability at least 1 — 6, it suffices to use k > ^ log 1/5 to guarantee Fta) !i (1 + ^)F^a)- 



Ideally, we hope Gr will be as small as possible. In fact, in order to achieve a i/-additive algorithm for entropy 
estimation, we need e = i^A (where A < lO"** or even much smaller). Therefore, we really need Gu = O (A^). 
In this sense, it is no longer appropriate to treat Gr as a "constant." 

Theorem[3]presents the tail bounds for ^'(q). 
Theorem 3 For any e > and 0<A = 1 — a<l, we have the right tail bound for the proposed estimator: 

Pr > (1 + e)i^(,)) < exp (~k^) (19) 



Gr 

where Ir is the solution to 



V h r(l + ^) + (l + e)VAA 



y-00 (-i)"(tR)"-i 

2^n=l (n-l)\ r(l + ^) 



y^oo (^tnr (l + e)i/AA 



(20) 
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For any < e < 1 and 0<A = 1 — q;<1, we have the left tail bound: 

Pr < (1 - < exp (^fc^) (21) 



= - log 



(^L)"r(i + f) , tL 



where II is the solution to 



Gl n\ r(l + if) (l-e)i/AA 



Z^„=i („_i)! r(i+^) 1 
' y^oo (t.)" r(TT^ + (1 - e)iMA 

Z^n=0 n! r(l + ^) 



(22) 



Proof: See Appendix^ 



These bounds appear to be too complicated to gain insightful information. People may be even wondering 
about numerical stability of the infinite sums. 

First of all, we notice that when A = 1 (i.e., a — 0), we can compute the tail bounds exactly, as presented in 
Lemma|5] 

Lemma 5 When A = 1, i.e., a = 0, 



log(l + e)--— , e>0 (23) 
Ljr i + e 

^ =log(l-e) + -^, 0<e<l. (24) 

Proof: When A = 1 (a = 0), we have T (l + f ) = n!, T (l + ^) 1, E^^o ^" = T^t' ""'^ 
E^g(— i)" = yi^. The conclusions follow easily. □ 



Next, we re-formulate the tail bounds to facilitate numerical evaluations. Our numerical results show that, 
when A is small, Gji « (6 ~ 9)A^ and Gl ~ (4 ^ 6)A^, for < < 1. Thus, we indeed have an algorithm 
for entropy estimation with complexity O (^) . 



The tail bounds ( fT9] l and ( |2T] ) contain —r—^-^, which can be written as 

^ ^ r(i+^)' 



r(i + s) _ r(i + ^) 



r(i + ^) r( 

Therefore, 



i + ^_„)-aU V-Va + 



1 

A" 



n (n - A) (n - 2A) ... (n - (n - 1)A) 



1 r(l + f) _ 1 'j-^^ n-jA ^ 1 n" ^ 1 



n! r (1 + 2^) A" 71 - J 



< 



A" n! ~ A" (n - 1)! ~ A" V27m 



according to the Stirling's series |T5l 8.327] 

T{n) = (n - 1)! = V2^(^-j 



1 1 139 



1271 2887i2 51840n3 



Thus, for numerical reasons, we can rewrite ( fT9] l and (ISTT l as 



71=1 



ri-1 



ri-1 



{n - j)e 



- [tR 



1 



tL- 



AJ e(l + e)i/A 

) I 

AJ e(l-e)i/A 
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The infinite series always converge provided tjf < ^ and < ^. In fact, because the bounds hold for any t > 
(not necessarily the optimal values, t/j and t^), we know ^ = 0(1) and ^ = 0(1) if using (e.g.,) t — 0.5-^. 
In other words, Gr ^ (^^) ^'^'^ Gl = (^^)> as desired. We state this as a Lemma. 

Lemma 6 The tail bound constants ( 1791 ) and ( 1271) 

^-0(1), 7^-0(l), (e^^A). 

/« Other words 

Gb. - O (A2) , Gl - O (A') . 

Therefore, to estimate F^^a;} within a (1 ± v A) factor, it suffices to let the sample size k = O {^), using the 
proposed estimator 7^(q)- 

Figure|2]presents the values in terms of ^ and ^ for < < 1 and A ~ 10^^, 10^**, 10^^, together with 
the closed-form expressions for A = 1 as obtained in Lemma |5] The values are pleasantly small. Thus, at least 
numerically, we can say, for example, when A is small. 




In other words, with a probability at least 1 — 5, using the proposed estimator, one can achieve — | £ 
(z^A)i^(Q-) by using k > 9 ^°^^^'' samples. And we know the constant 9 could be replaced by 6 if is small. 




"0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

V V 



Figure 2: Numerical values of Gr and Gl in the tail bounds ( fT9] l and (1211 1. for A = 10~^, A = 10~^, and 
A = 10~®, together with the closed-form expressions for A = 1 as obtained in Lemma|5] Because Gr — 
O (A^) and Gl = O (A^), we present the results in terms of §f and Note that as ^ 0, both ^ and ^ 
approach 6 — 4A, as proved in Lemma|7] Also, note that the curves for A = 10^^, A — 10^*, and A 10^^ 
largely overlap. 



Whenever possible, analytical expressions are always more desirable. In fact, when — > 0, we can actually 
obtain the analytical expressions for Gr and Gl- 
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Lemma 7 As 0, 



§|^6-4A, §|^6-4A. (25) 



Proof: See Appendix^ 



4 The Connection to the Sample Minimum Estimator 

In the previous (unpublished) work |29i . we proposed the sample minimum estimator, which allowed us to prove 
a much improved sample complexity bound than that in fTT\. Interestingly, the proposed estimator in this 
paper actually converges to the sample minimum estimator, denoted by F(Q),min; 



F, 



(a) 



1 

AA 



.7 = 1 ^. 



-q/A 



F, 



min{a;", j = l,2,...,fc}. 



(26) 



This fact is quite intuitive. As A 0, the smallest one of Xj's is amplified the most by x. 



-a/A 



This is 



analogous to the well-know fact that, the Ip norm approaches the loo norm (which is the maximum element of the 
vector), as p — oo. 



In ||29i . we proved the following (closed-form) sample complexity bound for F(^a),D 
Theorem 4 ^291/ As A = 1 - a ^ 0+, for any fixed e > 0, 



> (1 + e)i^(„)^ < exp ( fclog^ 



A 



A 



log(l + e) AlogA + log(l + e) 



0(A2) 



(27) 



Basically, in terms of e = vA, Theorem|4]is applicable when v is large (i/ ^ 1) and A is small. A simulation 
study in L29J demonstrated that the bound in Theorem|4]can be very sharp. 



5 Conclusion 

Real-world data are often dynamic and can be modeled as data streams. Measuring summary statistics of data 
streams such as the Shannon entropy has become an important task in many applications, for example, detecting 
anomaly events in large-scale networks. One line of active research is to approximate the Shannon entropy using 
the ath frequency moments of the stream with a very close to 1 (e.g., A = 1 — a < 10"'^ or even much smaller). 

Efficiently approximating the ath frequency moments of data streams has been very heavily studied in theoret- 
ical computer science and databases. When < a < 2, it is well-known that efficient O (l/e^) -space algorithms 
exist, for example, symmetric stable random proiections fTl] l26ll . which however are impractical for estimating 
Shannon entropy using a extremely close to 1. Recently, IZTl provided an algorithm to achieve the O (1/e) 
bound in the neighborhood of a = 1, based on the idea of maximally-skewed stable random projections (also 
called Compressed Counting (CC)). The algorithms provided in |27|, however, are still impractical. 

In this paper, we provide a truly practical algorithm for entropy estimation. We prove that its variance is 
proportional to O (A^) whereas previous algorithms for CC developed in 1 271 have variances proportional only to 
O (A). This new algorithm leads to an O (l/ j/^) algorithm for entropy estimation to achieve j/-additive accuracy, 
while previous algorithms must use O (l/(i^^A^)) samples 11211 l26l . or O (l/(i^^A)) samples (TT]. Note that 
because A is so small, it is no longer appropriate to treat it as "constant." 

We also analyze the precise sample complexity bound of the proposed new estimator, both numerically (for 
general < < 1) and analytically (for small i/), to demonstrate that the sample complexity bound of the new 
estimator is free of large constants. This further confirms that our proposed new estimator is practical. 
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A Proof of Theorem [T] 



As defined in (|7]i, the proposed estimator 



k 


A 




-a/A 




.J. 



where Xj ^ S{a, /? = 1, cos (^)), i-i-d. Denote J = F^^^^^. According to Lemma|2] 



E{J) = E f AxT"/^) = AFry^^4^±il = AF,-;/^ 



r(i + ^) 



(a) 



(") r(i + f) 



-2/A 



2/2 



A \ A 



1-1 



(a) 

=F-^/^ (3-2A) 

VariJ) = \var (a.T^M) ^ ^.p-V^ (3 „ 2A) = - 2A). 

A bit more algebra can show 

Recall Fj-Q.) = J~^. We will basically proceed by using the "delta" method popular in statistics. We need to 
be a bit careful here as A is small. Just to make sure the resultant higher-order terms are indeed negligible, we 
carry out the algebra. 

By the Taylor expansion about J, we obtain 



F(.) = J-^-(i-j) (Aj-^-i) + 

Taking expectations on both sides yields. 



[J-Jf 



A(A + 1)J 



-A-2 



F F 



-J-'' -E[.J ~ J) (AJ-^-i) 

Var (j] 
--J^^ + ^ 



E{J-jf 



A(A + l)J-^-^ + 



-J 



-A 



1 



2k 



A(A + l)J-^-2^... 

j2(3-2A)A(A + l)J-^-2 
= F(„) (1 + 



Evaluating the higher-order moments yields 



F 



- J- 



=F 



=F 



- ( J - j) (AJ-^-i 

(j-j) 



[J-Jf 



■A(A + 1)J- 



-A-2 







)'a2j-2A-2 


-F 



(J-J)= 



A2(A + 1)J 



-2A-3 



p2 

A2 



AM 3 - 2A + O - 
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and 



B Proof of Lemma |3] 

The task is to show that, as A = 1 — a — 0, 

2r2(l + a) 



1 = A + A2 2-— +0(A3) 



r(l + 2a) V 6 

Using properties of Gamma functions, for example, r(l + x) = xr{x), we obtain 

2r2(l + a) _ 2a2r2(a) _ aV^ia) _ ar2(l-A) _ a{-A)^T^-A) _ (1 - A)A r^{-A) 
r(l + 2a) ~ 2ar(2a) ~ r(2a) ~ r(2 - 2A) ~ (1 - 2A)(-2A)r(-2A) ^ (-2)(1 - 2A) r(-2A) 

Using the infinite product representation of the Gamma function| 15, 8.322], we obtain 



(7e is the Euler's constant) 



A n n ) n 



n=l 
oo 



n J 



-2^/ 2A 3A2 2A\ 



n=l 



= ^cxp(^^^^log(l-^ + 0(A3)) 

\ n=l 



Therefore, 



^2 



r(l + 2a) 1-2A V 6 ^ ^ 



= (1 - A) (1 + 2A + 4A2 + O (A^)) (^1 - ^ + O (A^) ) - 1 
= (1 + A + 2A2 + O (A'-*)) (^1 - ^ + O (A3) 
=A + A2 (2 - ^"j + O (A3) . 
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C Proof of Lemma |4] 



Suppose a random variable Z ~ 5* (a < = l,cos (fa)). We can show that the cumulative distribution 
function is 



1 r 



Recall Z 



Fz{t) = Pr{Z<t) = - I cxp(-J^;^i^^ll^sin(M)L0, (A = l-a). 

. V is uniform in [0, tt] and W is exponential with mean 1. Therefore, 



[siny]i 



i(yA) 



w 



Pr (Z > t) =Pr 



sin (aF) 



[sin V 



I/O 



sin (FA) 



W 



> t 



V "W^[siny]i/^ 7 

.E |^Pr|^iy<lM^^^::ll!^sin(l^A) 



= 1 — E cxp — 



V 



t"'/^ [sin V] 



^ sin (FA) 



-1 - - [ exp I 

TTJo V [sin 61] 



sin(a0)]"/^ . \ 



Fore e (0,7r), let 



It is easy to show that, as — > 0- 



//, AN [sin (q;6')1"^^ . ,^ ^ , 
[smffj ' 



lim g{0,A)^\im i^^^lM^ sin (0A) 



9^0+ 



lim 



sin(a6l)y/'^ sin(6'A) 
sin (aO) 



0^0+ \ smty 
a 

The proof of the monotonicity of g{9, A) is omitted, because it is can be inferred from the proof of the 
convexity. 

To show g{d; A) is a convex function 6, it suffices to show it is log-convex. Since 



g{9;A) = sin(6'A) 



[sin(a6')]"/'^ _ sin(6'A) 
[sin(6l)]i/A - sm(a6) 



sin(a6') 
sin(6') 



l/A 



it suffices to show that both and 



sin(Q6') 
sin(e) 



l/A 



are log-convex. 



c)[logsin(6'A) - logsinfa^)] cos(6'A) ^ cos(a6') 

^ -A -, — —a 



d"^ [logsin(6'A) — logsin(a0)] 



86 
A2 



sin(6'A) sin(a6 



in^ (61 A) ^ sin^ (a6l) Uin(a6') sin(6iA) ) ivsin(a6') ^ sin(6»A)) 



A \ / a 
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a[asin(6'A) - Asin(a6')] 



^ Aq(cos(6'A) - cos(a6')) > (because A < 0.5) 



Therefore, a sin(6'A) — A sin(Q;6') > and s™^^^) is convex 



i9[logsin(a6') — logsin(6')] cos{a6) cos{0) 

— -a 



de 



sin(a0) sin(6') 



d [log sin(a6') - logsin(0)] 



9612 sin^iae) sm^{9) ' \sm{e) sm{ae) J \sm{9) ' sin(a6i) 



d [sm{a9) — a sin(6')] 



= a{cos{a9) - cos{9)) > (because a = 1 - A > 0.5) 



Therefore, we have proved the convexity of g {9; A). 



D Proof of Theorem H 

Given k i.i.d. samples xj — cYj, the task is to estimate c" using MLE. The CDF of Yj is given by 

Frit) ^Pr{Y <t)^ exp (-i""/^ Aa"/^) , t e [0, oo). 

By taking derivatives, the density function of Xj is given by 

fxit) - -frit/c) ^ c"/^Fz(Vc)«i/At-i/A^ 
c 

because 

fz{t) = fz(t)(-A) (-a/ A) «"/Ai-o/A-l ^ j^^(^)^l/A^-l/A^ 

Solving the MLE equation. 



we obtain 



AAc 



^fc -a/A 



A 



E Proof of Theorem m 



From the previous results, we know 



A 



X 7 ''^ S 



(a,P = l,cos (|a) , 



^l^J-^(a) r(l-A) 



— na /A 



^^-„/A^ r(i + ^) 
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We first study the right tail bound. 

Pr > (1 + e)F(„ 

=Pr 



1 



a/A 



k 



-q/A 



< 



(l + e)VAAF(^jf 



=Pr -tj: 



k -a/A 
■^3 



> -t 



(any t > 0) 



<E exp -tJ2 



^'1 



a/A 



1/A 



cxp ( t 



k 



exp -t- 



F, 



^l/A 



exp t 



(a) 



(l + e)VAA 

k 



i-ty 



-a/A 



(l + e)i/AA 
k 



in=0 



-1/A 



cxp t 



(l + e)i/AA 



V h r(i + ^ 



(l + e)i/AA 



'-) (l + e)i/AA^ 
We can choose the optimal t to minimize this upper bound. Thus, 

Pr (Pia) > (1 + e)F(^a)) < exp (-^^ 



[^^, n\ r(l + ^)^(l + e)i/AA^ 



where ta is the solution to 



Z^„=i („_i)! r(i+^) 



y-oo (-tfl)" r(i+f ) (l + e)i/AA 



= 
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Now, we look into the left tail bound. 

Pr < (1 - e)F(„) 

k 



=Pr 



-I A 



V T 



-a/A 



< (1 - 



fe -a/A 



> t 



\ F^;;/^ - (1 - f)i/^A 



(any t > 0) 



<E exp 



k -a/A 



-a/ A ' 



-1/A 



exp — i 



'-E^ exp 



F-i/A 



exp —t 



(l-e)i/AA 



-a/ A 



-1/A 



exp — i 



{a) 



(1 - e)i/AA 



(l-e)i/AA 
k 



r(i + ^) \ 



= exp 



r (1 + f ) 



-.n!r(l + 2f) (l-e)i/AAyy 
Again, we can choose the optimal t = to minimize this upper bound. Thus 

Pr < (1 - < exp (-fc^) 



Gl ^°^5o r(l + ^) ' (1-6)VAA 



where <l is the solution to 



.oo (tz,)"-! r(i+g.) 



Eoo 
n=l 



'("-!)! r(i+^) 1 
(*.)" r(i+i) + (1 - g)i/AA 



= 



l^n=0 n\ r(l + ^) 



F Proof of Lemma |7] 



We have derived ^ and ^ in Theorem|3] The task of this Lemma is to show that, d& v ^ Q, 



% ^ 6 - 4A. 
A^ 



To proceed with the proof, we first assume that, as i/ — !• 0, we have 



tR- 



A 



SR - O [V) 



tL- 



A 



SL = M 
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which can be later verified. With this assumption, we can expand 



log 1 1 + E (^-r n F4 1 + 77^' 



n=l j=0 



log 1 + Si/e + 4 — 2— + ... + 



2 2-A , \ , sl 



e2 e(l-e)i/A 



22-A, 1/ , , 2 2-A, W Sl 



SL/e + + ... - - ^SL/e + Sl^ + ...) j + _ ^y/^ 

2 2-A S? \ SL 



2e2 e(l-e)VA 



2 3 - 2A , \ , SL 



2e2 ^"7 ^ e(l-e)i/A 



Setting the first derivative to zero, 



we obtain 
1 

Sl = 



3-2A 
■ Sl o h 



1 



e(l - !/A)VA 



e V (1 - z^A) i/A y 3 - 2A V (1 - z^A) i/A 



3-2A 



3- 2A 



which verifies that sz, = — is indeed on the order of u. Therefore, 



■ - sl/b - sl 



3- 2A 



■ + 



Sl 



2e2 e(l - i/A)i/A 



+ 



+ (z^+y(l + A) 



3-2A 

1 

3-2A 



'^ + y(i + a) 



1 



2(3 - 2A) 



3-2A 

,,2 



-z.-^(l + A)-^ + z.+ ^(l + A) + r.2 



6 - 4A ^ ^ 

Thus, we have proved that ^ — 6 — 4A as ^ 0. A similar procedure can also prove ^ ^ 6 — 4A. 
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G Experiments 



This section demonstrates that the proposed estimator F(q,) in (|7]l for Compressed Counting (CC) is a truly prac- 
tical algorithm, while the previously proposed geometric mean algorithm IZTl for CC is inadequate for entropy 
estimation. We also demonstrate that algorithms based on symmetric stable random projections ISTl l27l are not 
suitable for entropy estimation. 

G.l Data 

Since the estimation accuracy is what we are interested in, we can simply use static data instead of real data 
streams. This is because the projected data vector X = Y{7 At is the same at the end of the stream (i.e., time t), 
regardless whether it is computed at once (i.e., static) or incrementally (i.e., dynamic). 

Eight English words are selected from a chunk of Web crawl data. The words are selected fairly randomly, 
although we make sure they cover a whole range of data sparsity, from function words (e.g., "A"), to common 
words (e.g., "FRIDAY") to rare words (e.g., "TWIST"). Thus, as summarized in Table[Tl our data set consists of 
8 vectors and the entries are the numbers of word occurrences in each document. 



Table 1 : The data set consists of 8 English words selected from a corpus of Web pages, forming 8 vectors whose values are 
the word occurrences. The table lists their fractions of non-zeros (sparsity) and the Shannon entropies {H). 



Word 


Spai-sity 


Entropy H 


TWIST 


0.004 


5.4873 


FRIDAY 


0.034 


7.0487 


FUN 


0.047 


7.6519 


BUSINESS 


0.126 


8.3995 


NAME 


0.144 


8.5162 


HAVE 


0.267 


8.9782 


THIS 


0.423 


9.3893 


A 


0.596 


9.5463 



G.l Estimating Frequency Moments 

We estimate the ath frequency moments , for A = 1 — a = 0.2, 0. 1, 10~^^, using the proposed new estimator 
F(^a) ™d the geometric mean estimator F(^a),gm? well as the geometric mean estimator for symmetric stable 
random projections proposed in [26] . Recall 

H A 



a/A 



F< 



{a),gm 



r(i-f) 



n 



We find Ff^^-f is numerically very stable, if we express it as 



F 



(a) 



1 

AA 



A 



where -Fji), the first moment, can be computed exactly. Using Matlab (the 32-bit version), we find no numerical 
problems with even for very small A (e.g., A = 10^^''; see Figure[3]l. 



However, we could not find a numerically very stable implementation of the geometric mean estimator 
F{a).gm7 whcn A < lO^'^. We tried a variety of ways (including the tricks in implementing -^'(q)) to imple- 



ment F( 



{a),gm 



and the Gamma functions (e.g., using "gammaln" instead of "gamma" in Matlab). Fortunately, we 



believe A = 10 is sufficiently small for comparing the two estimators. 
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Figure 3: Normalized MSEs for estimating the ath frequency moments using the geometric mean estimator 
P{a).gm (left panel) and the proposed new estimator F{^a) (middle panel) for CC, together with the geometric 
mean estimator (right panel) for symmetric stable random projections. For Fi^a),gm ^nd F(^a)^ we also plot their 
theoretical variances (dashed curves), which largely overlap the empirical MSEs whenever the algorithms are 
numerically stable. The proposed new estimator is numerically very stable even when A = 10^^''. In 
comparison, is not stable if A < 10"^. We present results at the sample sizes k = 10, 100, and 1000. 



We experiment with three k values: 10, 100, and 1000; and we present the estimation errors in terms of the 
normalized mean square errors (MSB, normalized by the square of the true values). As A decreases, the MSEs for 
the symmetric stable random projections (in the right panel of Figure[3]l are roughly flat, verifying that algorithms 
based on symmetric stable random projections do not capture the fact that the first moment (a = 1) should be a 
trivial problem. 

Using Compressed Counting (CC), the geometric mean estimator, F(a).gm (in the left panel of Figure[3]l, and 
proposed new estimator, (in the middle panel), clearly exhibit the desired property that the MSEs decrease 
as A decreases. Of course, as expected, -^(0,), has a much faster rate of decreasing than F(^a),gm\ the latter is also 
numerically much less stable when A < 10^^. 

G.3 Estimating Shannon Entropies 

After we have estimated the frequency moments, we use them to estimate the Shannon entropies using Tsallis 
entropies. For the data vector "TWIST", we present results at sample sizes A; = 3, 10, 100, 1000, and 10000. For 
all other vectors, we do not experiment with k — 10000. Figure|4]and Figure|5]present the normalized MSEs. 

Using CC and the proposed estimator Ff^^-^ (middle panels), only fc = 10 samples already produces fairly 
accurate estimates. In fact, for some vectors (such as "A"), even fc = 3 may provide reasonable estimates. We 
believe the performance of the new estimator is remarkable. Another nice property is that the estimation errors 
(MSEs) become stable after (e.g.,) A < 10"^ (or 10""'). 

In comparisons, the performance of the geometric mean estimator (left panels) for CC is not satisfactory. This 
is because its variance only decreases only at the rate of 0(A), not O(A^). 

Also clearly, using symmetric stable random projections (right panels) would not provide good estimates of 
the Shannon entropy (unless the sample size is extremely large (^ 10000) and one could carefully choose a good 
A to exploit the bias-variance trade-off). 
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Figure 4: Normalized MSEs for estimating Shannon entropies using F(^a).gm (left panels) and F(q,) (middle 
panels) for CC, and the geometric mean estimator for symmetric stable random projections (right panels). 
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Figure 5: Normalized MSEs for estimating Shannon entropies using F(^a).gm (left panels) 
panels) for CC, and the geometric mean estimator for symmetric stable random projections (rij 
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and F(q,) (middle 
;ht panels). 
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